arXiv: 1503.01820v 1 [cs.RO] 6 Mar 2015 


1 


Latent Hierarchical Model for Activity Recognition 
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Abstract —We present a novel hierarchical model for human 
activity recognition. In contrast to approaches that successively 
recognize actions and activities, our approach jointly models 
actions and activities in a unified framework, and their labels 
are simultaneously predicted. The model is embedded with a 
latent layer that is able to capture a richer class of contextual 
information in both state-state and observation-state pairs. Al¬ 
though loops are present in the model, the model has an overall 
linear-chain structure, where the exact inference is tractable. 
Therefore, the model is very efficient in both inference and 
learning. The parameters of the graphical model are learned 
with a Structured Support Vector Machine (Structured-SVM). 
A data-driven approach is used to initialize the latent variables; 
therefore, no manual labeling for the latent states is required. 
The experimental results from using two benchmark datasets 
show that our model outperforms the state-of-the-art approach, 
and our model is computationally more efficient. 

Index Terms —Human activity recognition, RGB-D perception, 
Probabilistic Graphical Models, Personal Robots. 


I. Introduction 

T HE use of robots as companions to help people in their 
daily life is currently being widely studied. Numerous 
studies have focused on providing people with physical [1], 
cognitive 0 or social (3J support. To achieve this, a funda¬ 
mental and necessary task is to recognize human activities. 
For example, to decide when to offer physical support, a 
robot needs to recognize that a person is walking. To decide 
whether to remind people to continue drinking, a robot needs 
to recognize past drinking activities. To determine whether a 
person is lonely, a robot needs to detect interactions between 
people. In this paper, we propose a hierarchical approach to 
model human activities. 

Different types of sensors have been applied to the task of 
activity recognition 0, 0- Kasteren et al. (6) adopt a set of 
simple sensors, i.e. , pressure, contact, and motion sensors, to 
recognize daily activities of people in a smart home. Hu et al. 
( 7 ) use a ceiling-mounted color camera to recognize human 
postures, and the postures are recognized based on still images. 
Recently, RGB-D sensors, such as the Microsoft Kinect and 
ASUS Xtion Pro, have become popular in activity recognition 
because they can capture 2.5D data using structured light, 
thereby allowing researchers to extract a rich class of depth 
features for activity recognition. In this work, we equip a robot 
with an RGB-D sensor to collect sequences of activity data, 
from which we extract object locations and human skeleton 
points, as shown in Fig. [l] Based on these observations, our 
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Fig. 1. An example that shows a robot helping people in an elderly home, [(a)] 
Care-O-bot 3 offers water to the elderly after detecting that the elderly resident 
has not drunk any water for a long time. |(b)| In this work, an RGB-D sensor is 
used to recognize human activities. This work is built upon existing methods 
in object recognition, object localization, and human skeleton tracking. Object 
and human skeleton information are combined as the input of our model to 
infer human activities. 


task is to estimate activities as well as sequences of composing 
actions. 

We distinguish between activities and actions as follows. 
Actions are the atomic movements of a person that relate to at 
most one object in the environment, e.g. , reaching, placing, 
opening, and closing. Most of these actions are completed in 
a relatively short period of time. In contrast, activities refer to 
a complete sequence that is composed of different actions. 
For example, microwaving food is an activity that can be 
decomposed into a number of actions such as opening the 
microwave, reaching for food, moving food, placing food, 
and closing the microwave. The relation between actions and 
activities is illustrated in Fig. [2] 

The recognition of actions is usually formulated as a sequen¬ 
tial prediction problem [8| (see Fig. [2]). In this approach, the 
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Fig. 2. An illustration of the activity hierarchy. The input video is represented 
as a spatial-temporal volume. The bottom layer shows a video that is 
discretized into multiple temporal segments for modeling, and spatial-temporal 
features are extracted from each temporal segment. In the middle layer, actions 
are recognized from the input features with one atomic activity per segment. In 
the second layer, the activities are described in terms of the sub-level activity 
sequence. The un-directed links in the graph represent the inter-dependency 
between layers. Note that the video segments may not have the same length; 
thus, a segmentation method needs to be applied. 




Fig. 3. The graphical representation of our model. Nodes that represent the 
observations x, which are observed both in training and testing, are rendered 
in black, y refers to action nodes, and A is the corresponding activity label 
of the sequence. Both are in gray because they are only observed during 
training and not testing. White nodes z refer to the latent variables, which 
are unknown either in training or testing. They are used to represent the 
hidden sub-level semantics among consecutive actions. Note that x &, y 
are fully connected in our model as are the temporal transitions of action- 
latent pairs. Therefore, the model enables a richer representation of a activity 
hierarchy, xq represents the set of global features. 


RGB-D video is first divided into smaller video segments so 
that each segment contains approximately one action. This can 
be accomplished either by manual annotation or by automated 
temporal segmentation based on motion features. Spatio- 
temporal features are extracted for each segment. For real- 
world tasks in HRI, it is desirable to recognize activities at a 
higher level whereby the activities are usually performed over 
a longer duration. The combination of actions and activities 
forms a sequential model with a hierarchy (Fig. [2]). 

Most previous work addresses activity and action recogni¬ 
tion as separate tasks [8j-|T0|, i.e. , the action labels need to be 
inferred before the activity labels are predicted. In contrast, in 
this paper, we jointly model actions and activities in a unified 
framework, where the activity and action labels are learned 
simultaneously. Our experimental evaluation demonstrates that 
this framework is beneficial when compared to separate recog¬ 
nition. This can be intuitively understood by considering the 
case of learning actions: the activity label provides additional 
constraints to the action labels, which can result in a better 
estimation of the actions, and vice versa. 


Fig. [3] is a graphical representation of our approach. The 
proposed model of this paper is based on our previous work 
HD wherein we recognize the sequence of actions using 
Conditional Random Fields (CRFs). The model is augmented 
with a layer of latent nodes to enrich the model’s expressive¬ 
ness. For simplicity, we use latent variables to refer to the 
variables in the hidden layer, which are unknown both during 
training and testing. Labels , in contrast, are known during 
training but are latent during testing. The latent variables are 
able to capture such a difference and are able to model the 
rich variations of the actions. One can imagine that the latent 
variables represent sub-types of the actions: e.g. , for the action 
opening , we are able to model the difference between opening 
a bottle and opening a door using latent variables. 

For each temporal segment, we preserve the full connectiv¬ 
ity among observations, latent variables, and action nodes, thus 
avoiding making inappropriate conditional independence as¬ 
sumptions. We describe an efficient method of applying exact 
inference in our graph, whereby collapsing the latent states and 
target states allows our graphical model to be considered as 
a linear-chain structure. Applying exact inference under such 
a structure is very efficient. We use a max-margin approach 
for learning the parameters of the model. Benefiting from the 
discriminative framework, our method needs not model the 
correlation between the input data, thus providing us with a 
natural way of data fusion. 

The model was evaluated using the RGB-D data from 
two different benchmark datasets 60 GD- The results are 
compared with a number of the state-of-the-art approaches 
©-GD The results show that our model performs better than 
the state-of-the-art approaches, and the model is more efficient 
in terms of inference. 

In summary, the contribution of this paper is a novel Hidden 
CRF model for jointly predicting activities and their sub- 
level actions, which outperforms the state of the art both 
in terms of predictive performance and in computational 
cost. Our software is open source and freely accessible at 
http: //ninghanghu. eu/activity _recognition.html 

In this paper, we address the following research questions: 

• How important is it to add an activity hierarchy to the 
model? 

• How important is it to add the latent layer to the model? 

• How important is it to joint model actions and activities? 

• How does our model compare with state-of-the-art ap¬ 
proaches? 

• How well can the model be generalized to a new prob¬ 
lem? 


The remainder of the paper is organized as follows. We 
describe the related work in Section [II] We formalize the 
model and present the objective function in Section III The 


inference and learning algorithms are introduced in Section IV 


and Section [V] We show the implementation details and the 
comparison of the results with the state-of-the-art approach in 
Section |VU 


II. Related Work 

The previous works can be categorized into two method¬ 
ologies. The first methodology divides the approaches based 
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on the hierarchical layout of the model, i.e. , whether the 
model contains a single layer or multiple layers. The second 
methodology is based on the nature of the learning method, 
i.e. , whether the method is discriminative or generative. 

A. Single-layer Approach and Hierarchical Approach 

Human activity recognition is a key component for HRI, 
particularly for the re-ablement of the elderly ED Depending 
on the complexity and duration of activities, activity recog¬ 
nition approaches can be separated into two categories (14): 
single-layer approaches and hierarchical approaches. Single¬ 
layer approaches © -(23) refer to methods that are able 
to directly recognize human activities from the data without 
defining any activity hierarchy. Usually, these activities are 
both simple and short; therefore, no higher level layers are 
required. Typical activities in this category include walking, 
waiting, falling, jumping and waving. Nevertheless, in the 
real world, activities are not always as simple as these basic 
actions. For example, the activity of preparing breakfast may 
consist of multiple actions such as opening a fridge, getting a 
salad and making coffee. Typical hierarchical approaches 
ED, ED -[[26] first estimate the sub-level actions, and then, 
the high-level activity labels are inferred based on the action 
sequences. 

Sung et al. ED proposed a hierarchical maximum entropy 
Markov model that detects activities from RGB-D videos. 
They consider the actions as hidden nodes that are learned 
implicitly. Recently, Koppula et al. 0 presented an interesting 
approach that models both activities and object affordance as 
random variables. The object affordance label is defined as 
the possible manners in which people can interact with an 
object, e.g. , reachable, movable, and eatable. These nodes are 
inter-connected to model object-object and object-human in¬ 
teractions. Nodes are connected across the segments to enable 
temporal interactions. Given a test video, the model jointly 
estimates both human activities and object affordance labels 
using a graph-cut algorithm. After the actions are recognized, 
the activities are estimated using a multi-class SVM. In this 
paper, we build a hierarchical approach that jointly estimates 
actions and activities from the RGB-D videos. The inference 
algorithm is more efficient compared with graph-cut methods. 

B. Generative Models and Discriminative Models 

Many different graphical models, e.g. , Hidden Markov 
Models (HMMs) [12], [27], Dynamic Bayesian Networks 
(DBNs) [281, linear-chain CRFs (29), loopy CRFs 0, Semi- 
Markov Models (6), and Hidden CRFs (30], [[31], have been 
applied to the recognition of human activities. The graphical 
models can be divided into two categories: generative models 
[12], [271 and discriminative models 0, 0, 0- The gen¬ 
erative models require making assumptions concerning both 
the correlation of data and how the data are distributed given 
the activity state. This is risky because the assumptions may 
not reflect the true attributes of the data. The discriminative 
models, in contrast, only focus on modeling the posterior 
probability regardless of how the data are distributed. The 
robotic and smart environment scenarios are usually equipped 


with a combination of multiple sensors. Some of these sensors 
may be highly correlated both in the temporal and spatial 
domain, e.g. , a pressure sensor on a mattress and a motion 
sensor above a bed. In these scenarios, the discriminative 
models provide a natural way of data fusion for human activity 
recognition. 

The linear-chain Conditional Random Field (CRF) is one 
of the most popular discriminative models and has been 
used for many applications. Linear-chain CRFs are efficient 
models because the exact inference is tractable. However, 
these models are limited because they cannot capture the 
intermediate structures within the target states (32). By adding 
an extra layer of latent variables, the model allows for more 
flexibility and therefore can be used for modeling more 
complex data. The names of these models, including Hidden- 
unit CRF [[33], Hidden-state CRF (32) or Hidden CRF ED, 
are inter-changeable in the literature. 

Koppula et al. 0 present a model for the temporal and 
spatial interactions between humans and objects in loopy 
CRFs. More specifically, they develop a model that has two 
types of nodes for representing the action labels of the human 
and the object affordance labels of the objects. Human nodes 
and object nodes within the same temporal segment are fully 
connected. Over time, the nodes are transited to the nodes 
with the same type. The results show that by modeling the 
human-object interaction, their model outperforms the earlier 
work in ED and |[34). The inference in the loopy graph is 
solved as a quadratic optimization problem using the graph-cut 
method (35) . Their inference method, however, is less efficient 
compared with the exact inference in a linear-chain structure 
because the graph-cut method requires multiple iterations 
before convergence; more iterations are usually preferred to 
ensure that a good solution is obtained. 

Another study (36) augments an additional layer of latent 
variables to the linear-chain CRFs. They explicitly model the 
new latent layer to represent the durations of activities. In 
contrast to (9), Tang et al. (36) solve the inference problem 
by reforming the graph into a set of cliques so that the exact 
inference can be efficiently solved using dynamic program¬ 
ming. In their model, the latent variables and the observation 
are assumed to be conditionally independent given the target 
states. 

Our work is different from the previous approaches in 
terms of both the utilized graphical model and the efficiency 
of inference. First, similar to (36), our model also uses an 
extra latent layer. However, instead of explicitly modeling the 
latent variables, we directly learn the latent variables from 
the data. Second, we do not make conditional independence 
assumptions between the latent variables and the observations. 
Instead, we add one extra edge between them to make the local 
graph fully connected. Third, although our graph also presents 
many loops, as in (9), we are able to transform the cyclic 
graph into a linear-chain structure wherein the exact inference 
is tractable. The exact inference in our graph only requires two 
passes of messages across the linear chain structure, which is 
substantially more efficient than the method in [91. Finally, we 
model the interaction between the human and the objects at 
the feature level instead of modeling the object affordance 
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as target states. Therefore, the parameters are learned and 
are directly optimized for activity recognition rather than for 
making the joint estimation of both object affordance and 
human activity. Because we apply a data-driven approach to 
initializing the latent variables, hand labeling of the object 
affordance is not necessary in our model. Our results show 
that the model outperforms the state-of-the-art approaches on 
the CAD 120 dataset |9j. 

III. Modeling Activity Hierarchy 

The graphical model of our proposed system is illustrated in 
Fig. [ 3 ] Let x = {sci, SC 2 ,..., xk\& k £ M D } be the sequence 
of observations, where K is the total number of temporal 
segments in the video. Our goal is to predict the most likely 
underlying action sequence y = {y^ y 2j . . ., yxlVk £ J 7 } 
and its corresponding activity label A E H based on the 
observations. We define Xo as the global features that are 
extracted from x. 

Each observation is a feature vector extracted from 
the segment k. The form of Xk is quite flexible. Xk can be 
collections of data from different sources, e.g. , simple sensor 
readings, human locations, human poses, and object locations. 
Some of these observations may be highly correlated with each 
other, e.g. , wearable accelerate meters and motion sensors 
would be highly correlated. Because of the discriminative 
nature of our model, we do not need to model such correlation 
among the observations. 

We define z = {z 1 , £ 2 ,..., zx\z k £ >2} to be the latent 
variables in the model. The latent variables, which are im¬ 
plicitly learned from the data, can be considered as modeling 
the sub-level semantics of the actions. For clarity, one could 
imagine that y k = 1 refers to the action opening. Then, the 
joint (y k = l,£/c = 1) can describe opening a microwave , 
and (y k = 1 ,z k = 2) can describe opening a bottle. Note 
that these two sub-types of opening actions differ greatly in 
the observed videos. However, the latent variables allow us to 
capture large variations in the same action. 

Next, we will formulate our model in terms of these defined 
variables. For simplicity, we assume that there are in total N y 
actions and N a activities to be recognized, and let us define 
N z as the cardinality of the latent variable. 

A. Potential Function 

Our model contains five types of potentials that together 
form the potential function. 

The first potential measures the score of making an obser¬ 
vation x k with a joint-state assignment ( z k ,y k )• We define 
$(x k ) to be the function that maps the input data into the 
feature space, w is a matrix that contains model parameters. 

^i{yk,z k ,x k ]Wi) = wi(y k ,z k ) • $(x k ) (1) 

where w i e xZxD and w j (y k , z k ) is the concatenation of 
the parameters that corresponds to y k and z k . 

This potential models the full connectivity among y k , z k and 
x k and avoids making any conditional independence assump¬ 
tions. It is more accurate to have such a structure because z k 
and x k may not be conditionally independent over a given yk 


in many cases. Let us consider the aforementioned example. 
Knowing that the action is opening , whether the latent state 
refers to opening a microwave or opening a bottle depends on 
how the opening action is performed in the observed video, 
i.e. , the latent state and the observation are inter-dependent 
given the action label. 

The second potential measures the score of coupling y k with 
z k . The score can be considered as either the bias entry of (JTJ 
or the prior of seeing the joint state (y k , Zk ). 

ip 2 (yk,Zk\w 2 ) = w 2 (y k ,z k ) ( 2 ) 

where W 2 represents the parameter of the second potential 
with w 2 £ R yxZ . 

The third potential characterizes the transition score from 
the joint state (y k ~i, z k -i) to ( yk^k )• Comparing with the 
normal transition potentials d), our model leverages the 
latent variable Zk for modeling richer contextual information 
over consecutive temporal segments. Our model not only con¬ 
tain the transition between the action states but also captures 
the sub-level context using the latent variables. 

^3(yk-i,z k -i,yk,z k ;w 3 ) = w 3 (y k - 1 ,z k -i,yk,Zk) 0) 

where the potential is parameterized by w 3 E 

The fourth potential models the compatibility among con¬ 
secutive action pairs and the activity. 

4 > 4 (yk-i,yk,A]W 4 ) = W 4 (y k -i,yk,A) (4) 

where E rTxYx'H anc [ WA [y k _ ± ^ y k: A) is a scalar that 
reflects the compatibility between the transition of an action 
and the activity label. 

The last potential models the compatibility between the 
activity label A and the global features xo. 

ip 5 (A,x 0 -,w 5 ) = w 5 (A) ■ $(x 0 ) (5) 

where the parameters W 5 £ K H can be interpreted as a global 
filter that favors certain combinations of Xq and A. 

Summing all potentials over the entire sequence, we can 
write the final potential function of our model as follows: 

K K 

F(A,y, z,x;w) = 'Y^w 1 (y k ,z k ) • $(x k ) + ^2w 2 (y k ,z k ) 

k =1 k =1 

K K 

+ '^ 2 w s(Vk-i,z k - 1 ,y k ,z k ) + '^ 2 w 4 (y k _ 1 ,y k ,A) 
k =2 k =2 

+ w 5 (A) • $(* 0 ) (6) 

The potential function evaluates the matching score between 
the joint states (A,y,z) and the input (; x , x 3 ). The score 
equals the un-normalized joint probability in the log space. 
The objective function can be rewritten in a more general 
linear form F(y,z,x;w) = w • ^t(y,z,x). Therefore, the 
model is in the class of log-linear models. 

Note that it is not necessary to explicitly model the latent 
variables; rather, the latent variables can be automatically 
learned from the training data. Theoretically, the latent vari¬ 
ables can represent any form of data, e.g. , time duration and 
action primitives, as long as the data can be used to facilitate 
the task. The optimization of the latent model, however, may 
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converge to a local minimum. The initialization of the random 
variables is therefore of great importance. We compare three 
initialization strategies in this paper. Details of the latent 
variable initialization will be discussed in Section IVI-B1I 

One may notice that our graphical model has many loops, 
which in general makes the exact inference intractable. Be¬ 
cause our graph complies with the semi-Markov property, we 
will now show how we benefit from such a structure to obtain 
efficient inference and learning. 

IV. Inference 

Given the graph and the parameters, inference is used to 
find the most likely joint state (A, y,z) that maximizes the 
objective function. 

(A*,y*,z*)= argmax F(A,y,z,x\w) (7) 
(. A, y ,z)enxyxz 

Generally, solving ([7]) is an NP-hard problem that requires 
the evaluation of the objective function over an exponential 
number of state sequences. Exact inference is preferable 
because it is guaranteed to find the global optimum. However, 
the exact inference usually can only be applied efficiently 
when the graph is acyclic. In contrast, approximate inference is 
more suitable for loopy graphs but may take longer to converge 
and is likely to obtain a local optimum. Although our graph 
contains loops, we can transform the graph into a linear-chain 
structure, in which the exact inference becomes tractable. If 
we collapse the latent variable Zk with and A into a single 
factor, the edges among Zk, Vk and A become the internal 
factor of the new node, and the transition edges collapse into 
a single transition edge. This results in a typical linear-chain 
CRF, where the cardinality of the new nodes is N y xN z x Na- 
In the linear-chain CRF, the exact inference can be efficiently 
performed using dynamic programming (37). 

Using the chain property, we can write the following re¬ 
cursion procedure for computing the maximal score over all 
possible assignments of y, z and A. 

V k (A,y k ,z k ) =w 1 (y k ,z k ) ■ $(x k ) + w 2 (y k ,z k ) 

+ max {w 3 (y k - 1 ,Zk-i,y k ,Zk) 

(yk-uZk-i)eyxz 

+ w 4 (y k -i,y k , A) + V fe _i (A, y k -i,z k -i)} 

( 8 ) 

The above function is evaluated iteratively across the en¬ 
tire sequence. For each iteration, we record the joint state 
(yk-i, Zk-i) that contributes to the max. When the last 
segment is computed, the optimal assignment of segment K 
can be computed as 

A*,y* K ,z* K = argmax V K (A, y K , z K ) + w 5 (A) • $(* 0 ) (9) 
A,y K ,z K 

Knowing the optimal assignment at K, we can track back 
the best assignment in the previous time step K — 1. The 
process continues until all y* and z* have been assigned, i.e. 

, the inference problem in 0 is solved. 

Computing ([8]) once involves 0(N y N z ) computations. In 
total, 0 needs to be evaluated for all possible assignments 
of (yk, Zk, Na)', thus, it is computed N y N z times. The total 


computational cost is, therefore, 0(Ny N% NaK). Such com¬ 
putation is manageable when N y N z is not very large, which 
is usually the case for the tasks of activity recognition. 

Next, we show how we can learn the parameters using the 
max-margin approach. 

V. Fearning 

We use the max-margin approach for learning the parame¬ 
ters in our graphical model. Given a set of N training examples, 
(x( n \ y^ n \ A^) (n = 1, 2, • * • ,7V), we would like to learn 
the model parameters w that can produce the activity label A 
and action labels y given a new test input x. Note that both 
activities and action labels are observed during training. The 
latent variables z are unobserved and will be automatically 
inferred from the training process. 

The goal of learning is to find the optimal model parameters 
w that minimize the objective function. A regularization term 
is used to avoid over-fitting. 

^ j ^ IMI 2 + <? E A (v w ’ y> aM > A ) | (io) 

where C is a normalization constant that is used to provide a 
balance between the model complexity and fitting rate. 

The loss function A(y( l \ y, A^\ A) measures the cost of 
making incorrect predictions, y and A are the most likely 
action and activity labels that are computed from (j7|. The loss 
function in (TT| ) returns zero when the prediction is exactly the 
same as the ground truth; otherwise, it counts the number of 
disagreed elements. 

1 K 

A(yM,y,A {i) ,A) = A1(A« = A) + ^E 1 ^ = &) 

k =1 

( 11 ) 

where l(-) is an indicator function and 0 < A < 1 is a scalar 
weight that balances between the two loss terms. 

This object function can be viewed as a generalized form of 
our previous work GD. where we recognize only the sequence 
of actions. This is can be performed by simply setting A to 
0 and leaving the graphical structure unchanged. The learning 
framework then only tracks the incorrectly predicted actions, 
regardless of the activities. 

Directly optimizing © is not possible because the loss 
function involves computing the argmax in 0. Following J38[ 
and (39), we substitute the loss function in fTO} by the margin 
rescaling surrogate, which serves as an upper-bound of the 
loss function. 

1 n 

min ( dltof+Cy; max [A (y w , y, , A)+ 

w Z z ' A,y,z 

i= 1 

n 

F(x^ l \y,A,z,w)\ — C ma x F(x^\ y^\ A^\ z, w)} 

(12) 

The second term in ( [12] ) can be solved using the augmented 
inference, i.e. , by plugging in the loss function as an extra 
factor in the graph, the term can be solved in the same way 
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as the inference problem using ([7]). Similarly, the third term 
of (12] ) can be solved by adding and as the evidence 
into the graph and then applying inference using 0- Because 
the exact inference is tractable in our graphical model, both 
of the terms can be computed very efficiently. 

Note that (12] ) is the summation of a convex and concave 
function. This can be solved with the Concave-Convex Pro¬ 
cedure (CCCP) (40) . By substituting the concave function 
with its tangent hyperplane function, which serves as an 
upper bound of the concave function, the concave term is 
transformed into a linear function. Thus, ( fl2| ) becomes convex 
again. 

We can rewrite ( [12] ) in the form of minimizing a function 
subject to a set of constraints by adding slack variables 

1 n 

m MdM| 2 +C'y^&} (13) 

2 ti 

s.t. Vi G {1,2,... ,n},Vr/ G y 

F(x®,yV,A®,z*, w)-F(x®,y, A, z, w) 

> A (y^,y,A^,A)-^ 


where z* is the most likely latent states that are inferred given 
the training data. 

Note that there are an exponential number of constraints in 
This can be solved using the cutting-plane method ED- 
Another intuitive method of understanding the CCCP al¬ 
gorithm is to consider the algorithm as one that solves the 
learning problem with incomplete data using Expectation- 
Maximization (EM) [42| . In our training data, the latent 
variables are unknown. We can start by initializing the latent 
variables. Once we have the latent variables, the data become 
complete. Then, we can use the standard Structured-SVM to 
learn the model parameters (M-step). Subsequently, we can 
update the latent states again using the parameters that are 
learned (E-step). The iteration continues until convergence. 

The CCCP algorithm decreases the objective function in 
each iteration. However, the algorithm cannot guarantee a 
global optimum. To avoid being trapped in a local minimum, 
we present three different initialization strategies, and details 
will be presented in Section |VI-B1| 

Note that the inference algorithm is extensively used in 
learning. Because we are able to compute the exact inference 
by transforming the loopy graph into a linear-chain graph, our 
learning algorithm is much faster and more accurate compared 
with the other approaches with approximate inference. 


VI. Experiments and Results 

We implemented the proposed model, denoted as full model, 
along with its three variations. Specifically, the first model rec¬ 
ognizes only low-level actions, the second model recognizes 
high-level activities, and the third model recognizes activities 
based on actions. All models were evaluated on two different 
datasets. The results from the different models were compared 
to gain insight into our research questions (in Section [I]). The 
full model is also shown to outperform the state-of-the-art 
methods. 


A. Datasets 

The methods were evaluated on two benchmark datasets, i.e. 

, CAD-60 (12) and CAD-120 (9). Both of the datasets contain 
sequences of color and depth images that were collected by 
a RGB-D sensor. Skeleton joints of the person are obtained 
using OpenNQ 

The two datasets are quite different from each other; 
therefore, they can be used to test the generalizability of 
our methods. The CAD-60 dataset consists of 12 human 
action labels and no activity labels. The actions include 
rinsing mouth, brushing teeth, wearing contact lens, talking 
on the phone, drinking water, opening pill container, cooking 
(chopping), cooking (stirring), talking on couch, relaxing on 
couch, writing on white board, and working on computer. 
These actions are performed by 4 different subjects in 5 
different environments, i.e. , a kitchen, a bedroom, a bathroom, 
a living room, and an office. In total, the dataset includes 
approximately 60 videos, and each video contains one action 
label. In contrast, the CAD-120 dataset |9| contains 126 RGB- 
D videos, and each video contains one activity and a sequence 
of actions. There are in total 10 activities defined in the dataset, 
including making cereal, taking medicine, stacking objects, 
unstacking objects, microwaving food, picking up objects, 
cleaning objects, taking food, arranging objects, and having a 
meal. Fig. [4] shows various sample images of these activities. 
In addition, the dataset also consists of 10 sub-level actions, 
i.e. , reaching, moving, pouring, eating, drinking, opening, 
placing, closing, scrubbing, and null. The objects in CAD-120 
are automatically detected as in 0, and the locations of the 
objects are also provided by the dataset. 

The two datasets are very challenging in the following 
aspects, a) The activities in the dataset are performed by four 
different actors. The actors behave quite differently, e.g. , in 
terms of being left or right handed, being viewed from a front 
view or side view, and sitting or standing, b) There is a large 
variation even for the same action, e.g. , the action opening 
can refer to opening a bottle or opening the microwave. 
Although both of actions have the same label, they appear 
significantly different from each other in the video, c) Partial 
or full occlusion is also a very challenging aspect for this 
dataset, e.g. in certain videos, the actors’ legs are completely 
occluded by the table, and objects are frequently occluded by 
the other objects. This makes it difficult to obtain accurate 
object locations as well as body skeletons; therefore, the 
generated data are noisy. 

A number of recent approaches E-m have been eval¬ 
uated on these two datasets; therefore, the results can be 
directly compared. To ensure a fair comparison, the same 
input features are extracted following (9|. Specifically, we 
have object features (j) 0 (xk) G M 180 , object-object interaction 
features <j> 0 o{xk) £ M 200 , object-subject relation features 
</>oa(#/c) G M 400 , and the temporal object and subject features 
Mx k ) G M 200 . For CAD-120, we extract the complete set 
of features from the object locations, which are provided 
by the dataset. For CAD-60 dataset, only skeletal features 
are extracted because there is no object information. These 

1 http://stmcture.io/openni 
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Picking Objects Stacking Objects Taking Food Taking Medicine Unstacking Objects 


Fig. 4. Sample images from the CAD-120 dataset. The images illustrate 10 different activities. 


features are concatenated into a single feature vector, which 
is considered as the observation of one action segment, i.e. , 

$(**:). 

B. Implemented Models 

In this section, we describe the three baseline models in 
detail, followed by the introduction of th efull model. Note that 
all of the models were first evaluated on the CAD-120 dataset. 
To test the generalizability of the model, the same experiments 
are repeated using the CAD-60 dataset. Note that the CAD-60 
dataset contains no activity labels, but it has additional labels to 
indicate the environments. In our experiments, we treat these 
additional labels as if they were activities ; thus, the model 
structure is left unchanged compared with the experiments on 
CAD-120. The only difference is that we jointly model the 
actions with the environments instead of the activities. 

1) Recognize Only Low-level Actions: The first model is 
adopted from our previous work GT which predicts action 
labels based on the video sequence. This model is a single¬ 
layer approach that only contains the low-level layer of nodes. 
By setting the weight A to zero, the model focuses only on 
predicting correct action labels regardless of the activity label; 
therefore, the model can be considered as a special case of the 
full model. The parameters of this model are learned with the 
Structured-SVM framework (38j. We use the margin rescaling 
surrogate as the loss. For optimization, we use the 1-slack 
algorithm (primal), as described in (43) . 

To initialize the latent states on CAD-120, we adopt three 
different initialization strategies, a) Random initialization. The 
latent states are randomly selected, b) A data-driven approach. 
We apply clustering on the input data x. The number of 
clusters is set to be the same as the number of latent states. We 
run K-means for 10 times. Then, we choose the best clustering 
results based on minimal within-cluster distances. The labels 
of the clusters are assigned as the initial latent states, c) 
Initialized by Object affordance. The object affordance labels 
are provided by the CAD-120 dataset, which are used for 
training in (9). We apply the K-means clustering upon the 


affordance labels. As the affordance labels are categorical, 
we use 1-of-N encoding to transform the affordance labels 
into binary values for clustering. The CAD-60 dataset do not 
contain affordance labels, therefore the latent variables are 
initialized only with the data-driven approach. 

2) Recognize Only High-level Activities: The second model 
contains only a single layer for recognizing activities, i.e. , 
we disregard the layer of actions; instead, we learn a direct 
mapping from video features to activity labels. Similar to the 
first model, the parameters are learned with the Structured- 
SVM, but the model contains no transition. 

3) Recognize Activities Based on Action Sequences: This 
approach is built upon the first baseline. Based on the inferred 
action labels, we learn a model to classify activities. We extract 
unigram and bigram features based on the action sequence 
as well as the occlusion features. The model parameters are 
estimated with a variation of multi-class SVM, where the latent 
layer is augmented in the model. In this approach, the actions 
and activities are recognized in succession. 

4) Joint Estimation of Activity and Actions using Hierarchi¬ 
cal Approach (full model): This approach refers to the pro¬ 
posed model of the paper. Instead of successively recognizing 
actions and activities, our model uses a hierarchical framework 
to make joint predictions over both activity and action labels. 

We compare two different segmentation methods to the 
videos in the CAD-120 dataset. In the first method, we use the 
ground truth segmentation, which is manually annotated. For 
the second segmentation, we apply a motion-based approach, 
i.e. , we extract the spatial-temporal features for all the frames, 
and similar frames are grouped together using a graph-based 
approach to form segments. For CAD-60, we apply uniform 
segmentation as in [0 to enable a fair comparison with other 
methods. 

The above methods were evaluated on both the CAD-120 
and CAD-60 datasets. Because the two datasets are quite 
different from each other, they can be used to test how the 
results can be generalized to new data. The performance of 
these methods on both datasets is reported in Section |VI-D| 










































C. Evaluation Criteria 

Our model was evaluated with 4-fold cross-validation. The 
folds are split based on the 4 subjects. To choose the hyper¬ 
parameters, i.e. , the number of latent states and segmentation 
methods, we used two subjects for training, one subject for 
validation and one subject for testing. Once the optimal hyper¬ 
parameters are chosen, the performance of the model during 
testing is measured by another cross-validation process, i.e. 
, training using videos of 3 persons and testing on a new 
person. Each cross-validation is performed 3 times. To observe 
the generalization of our model across different data sets, the 
results are averaged across the folds. In this paper, the accuracy 
(classification rate), precision, recall and F-score are reported 
to enable a comparison of the results. In the CAD-120 dataset, 
more than half of the instances are reaching and moving. 
Therefore, we consider precision and recall to be relatively 
better evaluation criteria than accuracy because they remain 
meaningful despite class imbalance. 


D. Results and Analysis 

In this section, we report the experimental results and 
compare the performance of different models. Table [I] shows 
the performance of all the models during testing on the CAD- 
120 dataset. Both the performances of the action and activity 
recognition are reported. For comparison, the results of both 
ground-truth segmentation and motion-based segmentation are 
reported. Table [II] shows the performance during testing on the 
CAD-60 dataset. 

Next, we analyze the results while referring to our research 
questions posed in Section [I] 

Importance of hierarchical model. In Table[I] Single Layer 
refers to the second baseline approach, wherein we learn a 
direct mapping from video-level features to activity labels. 
There is no intermediate layer of labels. The Single Layer 
approach achieves an average performance of over 70% in 
both segmentation methods but with a large standard error 
of approximately 5%. In contrast, the other hierarchical ap¬ 
proaches outperform the Single Layer approach by at least 10 
percentage points when using ground-truth segmentation and 5 
percentage points when using motion-based segmentation. By 
incorporating the layer of action labels, we can see significant 
improvements in terms of recognizing activities. Therefore, 
temporal information, such as transitions between actions, is 
a very important aspect of activity recognition. 

Table [II] shows the results using the CAD-60 dataset under 
similar experiment settings; however, the goal here is to predict 
actions together with the environment. We can see that the 
F-score of the environment prediction is increased by over 
11 percentage points when using the hierarchical approach 
(full model), which is significantly better than the single-layer 
approach. The hierarchical approaches also exhibit significant 
improvements in terms of precision and recall. The increase in 
the mean is over 6 percentage points, with a reduced standard 
error rate. 

Importance of embedding the latent layer. To demon¬ 
strate the importance of using latent variables, we compare 


the proposed model (full model) to the model without aug¬ 
menting latent variables (no latent). Table [I] shows that the 
full model outperforms no latent in terms of recognition of 
both actions and activities. Notably, after adding the latent 
variable, the precision and recall for activity is increased by 
over 4 and 5 percentage points, respectively, using ground- 
truth segmentation. When using motion-based segmentation, 
the performance of full model for an activity is increased by 10 
percentage points in terms of precision and 6 percentage points 
in terms of recall. The improvement is significant after using 
latent variables. Note that the no latent model is a special case 
of th q full model , i.e. , no latent is equivalent to th e full model 
when there is only one latent state. Here, we list these models 
separately to illustrate the effect of using multiple latent states. 
In contrast, Table [II] only shows the performance of the full 
model because the model starts overfitting the data when more 
than one latent states are applied to the model, i.e. , no latent 
(latent=l) achieves the best performance. From this, we can 
see that the model is quite flexible and that it can be used to 
fit data with varying levels of complexity by simply adjusting 
the number of latent states in the model. 

Importance of jointly modeling activity and action. Hu 
et al. GD EQ in Table prefers to a combination of the first and 
third baseline approaches, where we used a two-step approach 
to successively recognize actions and activities. This method 
shows significant improvement over the Single layer approach. 
However, their approach is significantly outperformed by our 
proposed hierarchical method (full model) using both seg¬ 
mentation methods. Notably, for activity recognition, the F- 
score is increased by 3 percentage points using our proposed 
model, with an increase of 4 percentage points in terms of 
precision and 6 percentage points in terms of recall. For 
action recognition, the performance gain in terms of F-score is 
approximately 3.7 percentage points and includes significant 
improvements in both precision and recall. This is because 
the full model allows the interaction between the low-level 
and high-level layers during both learning and inference, and 
labels with the hierarchy are jointly estimated when making 
predictions. Similar results were found using the CAD-60 
dataset, see Table [II] We note that the performance is largely 
increased when using the full model. The F-score is increased 
by 3 percentage points for predicting action and environment 
labels. 

Comparison with the state-of-the-art approaches. The 

proposed method was evaluated on both the CAD-60 and 
CAD-120 datasets to provide a comparison with the state-of- 
the-art methods. 

To be comparable with the other approaches, following 0, 
ED, we conduct similar experiments on the CAD-60 dataset, 
where we group the actions based on their environment labels 
and a separate model is learned and tested for each of the 
groups. The results of these experiments are reported in 
Table [HI] We note that our model outperforms m in all 
five environments. Compared with the state of the art [9], our 
model outperforms ED and (9} on most of the environments. 
On average, the precision of our model is the same as in 
Q, and the recall of the model outperforms |9j by over 8 
percentage points, achieving 80.8% for precision and 80.1% 



TABLE I 

Performance of Activity and Action Recognition during testing on the CAD-120 Dataset. The results are reported in terms of 
Accuracy, Precision, Recall and F-Score. The standard error is also reported. 


9 


Ground Truth Segmentation 

Methods 


Action 



Activity 



Accuracy 

Precision 

Recall 

FI-Score 

Accuracy 

Precision 

Recall 

F-Score 

Single layer 

- 

- 

- 

- 

74.2 ±5.1 

78.5 ±4.7 

73.3 ±5.1 

75.8 ±4.9 

Koppula et al. 

9l 86.0 ±0.9 

84.2 ± 1.3 

76.9 ±2.6 

80.4 ± 1.7 

84.7 ±2.4 

85.3 ±2.0 

84.2 ±2.5 

84.7 ±2.2 

Koppula et al. 

K)1 89.3 ±0.9 

87.9 ± 1.8 

84.9 ±1.5 

86.4 ± 1.6 

93.5 ±3.0 

95.0 ±2.3 

93.3 ±3.1 

94.1 ±2.6 

Hu et al. 181 11 

1] 87.0 ±0.9 

89.2 ±2.3 

83.1 ± 1.2 

85.5 ± 1.6 

90.0 ±2.9 

92.8 ±2.3 

89.7 ±3.0 

91.2 ±2.5 

Our Model (no latent) 87.2 d= 0.8 

87.4 ± 1.5 

85.0 ± 1.4 

86.2 ± 1.4 

87.9 ± 1.5 

91.9 ±0.7 

87.5 ±1.6 

89.7 ± 1.0 

Our Model (full) 89.7 ±0.6 

90.2 ±0.7 

88.2 ±0.6 

89.2 ±0.6 

93.6 ±2.7 

95.2 ±2.0 

93.3 ±2.8 

94.2 ±2.3 

Motion-based Segmentation 

Methods 


Action 



Activity 



Accuracy 

Precision 

Recall 

F-Score 

Accuracy 

Precision 

Recall 

F-Score 

Single layer 

- 

- 

- 

- 

75.0 ±5.3 

79.0 ±4.9 

74.2 ±5.5 

76.5 ±5.2 

Koppula et al. 

9l 68.2 ±0.3 

71.1 ± 1.9 

62.2 ±4.1 

66.4 ±2.6 

80.6 ±1.1 

81.8 ±2.2 

80.0 ± 1.2 

80.9 ± 1.6 

Koppula et al. 

1M 70.3 ±0.6 

74.8 ± 1.6 

66.2 ±3.4 

70.2 ±2.2 

83.1 ±3.0 

87.0 ±3.6 

82.7 ±3.1 

84.8 ±3.3 

Hu et al. [8] 11 

1] 70.0 ±0.3 

70.3 ± 0.5 

67.8 ±0.2 

69.0 ±0.3 

79.0 ±6.2 

86.4 ±4.9 

78.8 ±5.9 

82.4 ±4.4 

Our Model (no latent) 67.1 ± 0.4 

69.1 ± 1.2 

65.6 ±1.5 

67.3 ± 1.8 

79.0 ±2.0 

80.4 ±2.7 

78.5 ±2.0 

79.4 ±2.3 

Our Model (full) 70.2 ± 1.2 

71.1 ± 1.8 

69.9 ±1.9 

70.5 ± 1.9 

85.2 ± 1.4 

90.3 ± 1.9 

84.7 ± 1.5 

87.4 ± 1.7 


TABLE II 

Test Performance on the CAD-60 dataset with Uniform Segmentation. The standard error is also reported. 


Methods 


Action 



Environment 



Accuracy 

Precision 

Recall 

F-Score 

Accuracy 

Precision 

Recall 

F-Score 

Single layer 

Hu et al. 181 |11| 
Our Model (full) 

66.5 ±4.3 

74.4 ±4.0 

71.1 ±2.6 

80.3 ±4.4 

67.1 ±3.7 

81.0 ± 1.6 

67.7 ±3.4 

80.7 ±2.9 

50.0 ±2.8 

60.0 ± 1.5 

60.6 ±0.5 

63.0 ±2.5 

71.0 ±2.5 

74.7 ±2.7 

52.8 ±2.2 

62.1 ± 1.8 

62.5 ±1.0 

57.5 ±2.3 

63.0 ±2.1 

68.6 ± 1.9 


for recall. The average F-Score is over 4% percentage points 
better than in (5). 

Table [I] compares the performance of different approaches 
on the CAD-120 datasets. Similar to (8), Koppula et al. ]T0| 
use a two-step approach to infer high-level activity labels 
only after the actions are estimated. Benefiting from the joint 
estimation of action and activity, our full model outperforms 
the state-of-the-art models in terms of both action and activity 
recognition tasks. Notably, using ground-truth segmentation, 
the F-score is improved by approximately 4 percentage points 
for recognizing actions. Based on motion segmentation, the 
activity recognition performance is improved by over 2 per¬ 
centage points in terms of F-Score. 

Fig. [5] shows the confusion matrix of both the action and 
activity classification results. The most difficult action class is 
scrubbing. This task is sometimes confused with reaching and 
placing. The overall performance of the activity recognition is 
very good, with most of the activities being correctly classified. 
The more difficult case is to distinguish between “stacking 
objects” and “arranging objects”. Overall, we can see that 
high values are found on the diagonal using both segmentation 
methods, which demonstrates the good performance of our 


system. 

VII. Conclusion 

In this paper, we present a hierarchical approach that 
simultaneously recognizes actions and activities based on 
RGB-D data. The interactions between actions and activities 
are captured by a Hidden-state CRF framework. In this frame¬ 
work, we use the latent variables to exploit the underlying 
structures of actions. The prediction is based on the joint 
interaction between activities and actions, which is in contrast 
to the traditional approach, which only focuses on one of 
them. Our results show a significant improvement when using 
the hierarchical model compared to using the single-layered 
approach. The results also demonstrate the effectiveness of 
adding a latent layer to the model and the importance of jointly 
estimating actions and activities. Finally, we show that the 
proposed hierarchical approach outperforms the state-of-the- 
art methods on two benchmark datasets. 
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TABLE III 
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